rank order
Infinity Search: Approximate Vector Search with Projections on q-Metric Spaces
Pariente, Antonio, Hounie, Ignacio, Segarra, Santiago, Ribeiro, Alejandro
Despite the ubiquity of vector search applications, prevailing search algorithms overlook the metric structure of vector embeddings, treating it as a constraint rather than exploiting its underlying properties. In this paper, we demonstrate that in $q$-metric spaces, metric trees can leverage a stronger version of the triangle inequality to reduce comparisons for exact search. Notably, as $q$ approaches infinity, the search complexity becomes logarithmic. Therefore, we propose a novel projection method that embeds vector datasets with arbitrary dissimilarity measures into $q$-metric spaces while preserving the nearest neighbor. We propose to learn an approximation of this projection to efficiently transform query points to a space where euclidean distances satisfy the desired properties. Our experimental results with text and image vector embeddings show that learning $q$-metric approximations enables classic metric tree algorithms -- which typically underperform with high-dimensional data -- to achieve competitive performance against state-of-the-art search methods.
Causal Explainability of Machine Learning in Heart Failure Prediction from Electronic Health Records
Hou, Yina, Rabbani, Shourav B., Hong, Liang, Diawara, Norou, Samad, Manar D.
The importance of clinical variables in the prognosis of the disease is explained using statistical correlation or machine learning (ML). However, the predictive importance of these variables may not represent their causal relationships with diseases. This paper uses clinical variables from a heart failure (HF) patient cohort to investigate the causal explainability of important variables obtained in statistical and ML contexts. Due to inherent regression modeling, popular causal discovery methods strictly assume that the cause and effect variables are numerical and continuous. This paper proposes a new computational framework to enable causal structure discovery (CSD) and score the causal strength of mixed-type (categorical, numerical, binary) clinical variables for binary disease outcomes. In HF classification, we investigate the association between the importance rank order of three feature types: correlated features, features important for ML predictions, and causal features. Our results demonstrate that CSD modeling for nonlinear causal relationships is more meaningful than its linear counterparts. Feature importance obtained from nonlinear classifiers (e.g., gradient-boosting trees) strongly correlates with the causal strength of variables without differentiating cause and effect variables. Correlated variables can be causal for HF, but they are rarely identified as effect variables. These results can be used to add the causal explanation of variables important for ML-based prediction modeling.
- North America > United States > Tennessee > Davidson County > Nashville (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Netherlands (0.04)
- (2 more...)
- Research Report > New Finding (0.90)
- Research Report > Experimental Study (0.73)
- Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
- Health & Medicine > Therapeutic Area > Endocrinology > Diabetes (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.93)
Distillation Enhanced Generative Retrieval
Li, Yongqi, Zhang, Zhen, Wang, Wenjie, Nie, Liqiang, Li, Wenjie, Chua, Tat-Seng
Generative retrieval is a promising new paradigm in text retrieval that generates identifier strings of relevant passages as the retrieval target. This paradigm leverages powerful generative language models, distinct from traditional sparse or dense retrieval methods. In this work, we identify a viable direction to further enhance generative retrieval via distillation and propose a feasible framework, named DGR. DGR utilizes sophisticated ranking models, such as the cross-encoder, in a teacher role to supply a passage rank list, which captures the varying relevance degrees of passages instead of binary hard labels; subsequently, DGR employs a specially designed distilled RankNet loss to optimize the generative retrieval model, considering the passage rank order provided by the teacher model as labels. This framework only requires an additional distillation step to enhance current generative retrieval systems and does not add any burden to the inference stage. We conduct experiments on four public datasets, and the results indicate that DGR achieves state-of-the-art performance among the generative retrieval methods. Additionally, DGR demonstrates exceptional robustness and generalizability with various teacher models and distillation losses.
- Asia > Singapore (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- Asia > China > Hong Kong (0.04)
- (2 more...)
Temporal Sequencing of Documents
Gervers, Michael, Tilahun, Gelila
We outline an unsupervised method for temporal rank ordering of sets of historical documents, namely American State of the Union Addresses and DEEDS, a corpus of medieval English property transfer documents. Our method relies upon effectively capturing the gradual change in word usage via a bandwidth estimate for the non-parametric Generalized Linear Models (Fan, Heckman, and Wand, 1995). The number of possible rank orders needed to search through possible cost functions related to the bandwidth can be quite large, even for a small set of documents. We tackle this problem of combinatorial optimization using the Simulated Annealing algorithm, which allows us to obtain the optimal document temporal orders. Our rank ordering method significantly improved the temporal sequencing of both corpora compared to a randomly sequenced baseline. This unsupervised approach should enable the temporal ordering of undated document sets.
- North America > Canada > Ontario > Toronto (0.14)
- Europe > Austria > Vienna (0.14)
- Europe > Serbia > Central Serbia > Belgrade (0.04)
- (10 more...)
- Law (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Cloud Collectives: Towards Cloud-aware Collectives forML Workloads with Rank Reordering
Luo, Liang, Nelson, Jacob, Krishnamurthy, Arvind, Ceze, Luis
ML workloads are becoming increasingly popular in the cloud. Good cloud training performance is contingent on efficient parameter exchange among VMs. We find that Collectives, the widely used distributed communication algorithms, cannot perform optimally out of the box due to the hierarchical topology of datacenter networks and multi-tenancy nature of the cloudenvironment.In this paper, we present Cloud Collectives , a prototype that accelerates collectives by reordering theranks of participating VMs such that the communication pattern dictated by the selected collectives operation best exploits the locality in the network.Collectives is non-intrusive, requires no code changes nor rebuild of an existing application, and runs without support from cloud providers. Our preliminary application of Cloud Collectives on allreduce operations in public clouds results in a speedup of up to 3.7x in multiple microbenchmarks and 1.3x in real-world workloads of distributed training of deep neural networks and gradient boosted decision trees using state-of-the-art frameworks.
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > Illinois > Champaign County > Champaign (0.04)
Nearest Neighbor Search-Based Bitwise Source Separation Using Discriminant Winner-Take-All Hashing
We propose an iteration-free source separation algorithm based on Winner-Take-All (WTA) hash codes, which is a faster, yet accurate alternative to a complex machine learning model for single-channel source separation in a resource-constrained environment. We first generate random permutations with WTA hashing to encode the shape of the multidimensional audio spectrum to a reduced bitstring representation. A nearest neighbor search on the hash codes of an incoming noisy spectrum as the query string results in the closest matches among the hashed mixture spectra. Using the indices of the matching frames, we obtain the corresponding ideal binary mask vectors for denoising. Since both the training data and the search operation are bitwise, the procedure can be done efficiently in hardware implementations. Experimental results show that the WTA hash codes are discriminant and provide an affordable dictionary search mechanism that leads to a competent performance compared to a comprehensive model and oracle masking.
Model selection by minimum description length: Lower-bound sample sizes for the Fisher information approximation
Heck, Daniel W., Moshagen, Morten, Erdfelder, Edgar
For the published version of the article, see: Heck, D. W., Moshagen, M., & Erdfelder, E. (2014). Correspondence concerning this article should be addressed to Daniel W. Heck, Department of Psychology, School of Social Sciences, University of Mannheim, Schloss EO 254, D-68131 Mannheim, Germany. FISHER INFORMATION APPROXIMATION 2 Abstract The Fisher information approximation (FIA) is an implementation of the minimum description length principle for model selection. Unlike information criteria such as AIC or BIC, it has the advantage of taking the functional form of a model into account. Unfortunately, FIA can be misleading in finite samples, resulting in an inversion of the correct rank order of complexity terms for competing models in the worst case.
- Europe > Germany (0.25)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Semantic Explanations of Predictions
The main objective of explanations is to transmit knowledge to humans. This work proposes to construct informative explanations for predictions made from machine learning models. Motivated by the observations from social sciences, our approach selects data points from the training sample that exhibit special characteristics crucial for explanation, for instance, ones contrastive to the classification prediction and ones representative of the models. Subsequently, semantic concepts are derived from the selected data points through the use of domain ontologies. These concepts are filtered and ranked to produce informative explanations that improves human understanding. The main features of our approach are that (1) knowledge about explanations is captured in the form of ontological concepts, (2) explanations include contrastive evidences in addition to normal evidences, and (3) explanations are user relevant.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- (14 more...)
- Health & Medicine (0.68)
- Law > Statutes (0.46)
- Information Technology > Security & Privacy (0.46)